chore: resync API dumps to current develop source#647
Merged
Conversation
The committed binary-compatibility-validator baselines had drifted from the current source — e.g. `Q8_0BlockTensorData` implements `PackedBlockStorage` in code but the dump didn't reflect it, and several `ExecutionContext` accessors (`memoryPlanner`, `scratch`, `wrapByteArray`, …) were never re-dumped. Regenerated via `./gradlew apiDump` with no source changes, so `apiCheck` is green again repo-wide. No public API changes here — purely a baseline resync (1737 additions across 6 modules) so subsequent feature PRs show only their own deltas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Promotes Q4_0 (older GGML 4-bit, 18 bytes / 32 elements) from a JVM/MemSegment-only side-path to a first-class quantized format that any loader can produce and any backend can specialize, mirroring Q8_0: - commonMain `Q4_0TensorData` interface + `Q4_0BlockTensorData` (heap, ByteArray-backed) with `toFloatArray()` dequant and PackedBlockStorage. - `TensorEncoding.Q4_0` (32 elems / 18 bytes). - `Q4_0MatmulKernel` SPI + `KernelProvider.matmulQ4_0()` (default null) and a `"Q4_0"` case in `supports()`. - `ScalarQ4_0MatmulKernel` (portable commonMain floor) wired through `ScalarKernelProvider`. - `DefaultCpuOpsJvm`: lazy `q4_0MatmulKernel` resolved via KernelRegistry + an `is Q4_0TensorData ->` branch in `chooseQuantizedMatmul`. Uses the canonical ggml *split* nibble layout (low nibbles → elements 0..15, high → 16..31, `(code - 8) * d`) matching `DequantOps.dequantQ4_0FromBytes` — NOT the interleaved layout the existing JVM MemSeg `dotQ4_0BlockMemSeg` uses (that mismatch is the likely reason the Q4_0 MemSeg path was never exercised; PR2 reconciles it). Tests: Q4_0TensorDataTest (layout/dequant), Q4_0MatmulDispatchTest (scalar==dispatch), KernelProviderSupportsTest extended for Q4_0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
Adds `PanamaVectorQ4_0MatmulKernel` (JDK Vector API): per block, decode the FP16 scale, unpack the 16 code bytes into 32 sign-corrected floats in the canonical ggml split layout, then SIMD-FMA against the input window. Wired through `PanamaVectorKernelProvider.matmulQ4_0()` (priority 50), so `DefaultCpuOpsJvm`'s `q4_0MatmulKernel` now prefers it over the scalar floor on JDK 21+. Also fixes a latent layout bug: the existing JVM MemSegment Q4_0 path (`JvmQuantizedVectorKernels.dotQ4_0BlockMemSeg` and `Q4MemorySegmentTensorData` get/set/copyToFloatArray) used an *interleaved* nibble layout (code[2k]/[2k+1] from byte k), which does NOT match real GGUF Q4_0 weights (split layout: low nibbles → 0..15, high → 16..31, per `DequantOps.dequantQ4_0FromBytes`). This mismatch is the likely reason the Q4_0 MemSeg path was never exercised end-to-end. All three sites + the test encoder are reconciled to the split layout, so the MemSeg path now agrees with the heap `Q4_0BlockTensorData`, the scalar/Panama SPI kernels, and canonical ggml. Tests: PanamaVectorQ4_0MatmulKernelParityTest (scalar≈panama within FMA tolerance), QuantizedMemSegMatmulTest still green under split layout. apiCheck green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the Q4_0 kernel stack with a hand-written C kernel at priority 100. Adds native/src/q4_0_matmul.c (split-layout `(code - 8) * d` decode, tight auto-vectorizing inner loop mirroring q8_0_matmul.c), declares skainet_q4_0_matmul in skainet_kernels.h, and adds it to CMakeLists. Kotlin side: NativeQ4_0MatmulKernel (FFM downcall, mirrors NativeQ8_0MatmulKernel) wired through NativeKernelProvider.matmulQ4_0(). With the bundled libskainet_kernels loaded, KernelRegistry.bestAvailable() now prefers native → Panama → scalar for Q4_0, same cascade as Q8_0/Q4_K. Verified locally (cmake build): NativeQ4_0MatmulKernelParityTest passes — native output matches PanamaVectorQ4_0MatmulKernel within FMA tolerance across matvec / attention / FFN shapes. CI without the native lib stays green via the same availability gate the other native parity tests use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(q4_0): native FFM kernel (skainet_q4_0_matmul)
feat(q4_0): first-class Q4_0 core format + scalar kernel + SPI
feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
1 similar comment
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Regenerates the binary-compatibility-validator
.apibaselines via./gradlew apiDumpwith no source changes. The committed dumps had drifted from current source, soapiCheckwas effectively failing/unenforced.Evidence of drift (examples):
Q8_0BlockTensorDataimplementsPackedBlockStoragein source, but the committed dump omits it.ExecutionContextaccessors (memoryPlanner,memoryTracker,scratch,wrapByteArray/FloatArray/IntArray,placeholder) were never re-dumped.Why
This unblocks
apiCheckrepo-wide and lets the upcoming Q4_0 feature PRs show only their own ~20-line API deltas instead of mixing in ~1700 lines of unrelated baseline churn.Scope
Pure mechanical resync — 1737 additions / 49 deletions across 6 modules (
skainet-lang-core,skainet-backend-cpu,skainet-compile-{dag,hlo,opt},skainet-lang-dag). No.ktchanges. No new public API.🤖 Generated with Claude Code